Skip to content

Conversation

wks
Copy link
Collaborator

@wks wks commented Oct 9, 2025

We remove OnAllocationFail and add boolean fields to AllocationOptions:

  • allow_overcommit: whether we allow overcommit
  • at_safepoint: whether this allocation is at a safepoint
  • allow_oom_call: whether to call Collection::out_of_memory

Now Space::acquire always polls before trying to get new pages. Particularly, when allow_overcommit == true, polling and over-committing will happen in one allocation attempt. If we also set at_safepoint == false, the current mutator will be able to allocate normally in this allocation, but block for GC at the nearest safepoint. This is useful for certain VMs.

@wks wks mentioned this pull request Oct 10, 2025
wks added 2 commits October 13, 2025 17:54
We remove OnAllocationFail and add three boolean fields to
AllocationOptions.
@wks wks force-pushed the feature/overcommit-still-triggers-gc3 branch from 59ff447 to fa47af7 Compare October 13, 2025 10:04
@wks
Copy link
Collaborator Author

wks commented Oct 13, 2025

After this PR, the decision tree becomes: (assuming this is a mutator thread, and GC is already initialized)

  • Does the allocation option allow eager polling
    • If yes, poll, and move on.
    • If no, just move on.
  • Is any of the following true? (a) The poll above didn't trigger GC (Consider it "not triggered" if we skipped polling), or (2) the allocation option allows over-commit
    • If yes, try to get pages from the page resource, and move on.
    • If no, just move on.
  • Have we got pages from the page resource?
    • If yes, do the mmapping and return the address.
    • If no, are we at safepoint?
      • If yes, then
        • Have we tried getting new pages from the page resource?
          • If yes, force a GC, and move on.
          • If no, just move on.
        • block for GC.
        • return NULL.
      • If no, then return NULL immediately.

The control flow is more linear than before, with three steps, each using one boolean option.

By combining the three options, we can replicate the behaviors of the previous OnAllocationFail variants.

Variant eager_polling allow_overcommit at_safepoint
RequestGC true false true
ReturnFailure true false false
OverCommit false true false

We can make a new combination to poll (scheduling GC in the background) and overcommit at the same time, and postpone blocking for GC to the next safepoint. This can be useful for VMs where allocation never happens at safepoints.

new behavior eager_polling allow_overcommit at_safepoint
both poll and overcommit true true false

But I wonder whether we can remove the eager_polling option (i.e. always making it true). I can't think of any use cases where we don't want to poll. Polling only affects GC threads in the background. Even if GC is not initialized at this time, GC workers will be able to start the first GC immediately after GC is initialized. So it seems to be harmless to let it poll all the time.

@wks wks marked this pull request as ready for review October 14, 2025 07:54
@wks wks requested a review from qinsoon October 14, 2025 07:54
@wks
Copy link
Collaborator Author

wks commented Oct 15, 2025

We discussed in today's meeting. We should either

  • remove eager_polling (making the first polling compulsory, unless it is not a mutator or GC is not enabled), or
  • allow disabling the second polling (the one after trying to get pages from pr and failing), too.

I am in favor for removing eager_polling. I added eager_polling for the purpose of letting the user replicating the old OnAllocationFailure::OverCommit behavior. But we think we can change behavior as long as it is more reasonable. As I mentioned before, the GC is scheduled in the background without affecting the execution of the mutator, and I can't think of any reason why the mutator would try not to trigger GC. If it needs "critical section" semantics, we have a separate issue discussing this: #1398

The only thing it may affect may be NoGC. NoGC panics immediately in NoGC::schedule_collection. So the eager polling has a potential to let the process panic earlier than before, if allowing over-committing. I think this is actually reasonable because otherwise the program will simply ignore the heap size if it always uses over-committing.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than #1400 (comment), this PR looks good to me.

@@ -1,6 +1,8 @@
// GITHUB-CI: MMTK_PLAN=NoGC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test only for NoGC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it doesn't have to. The test is about the behavior of allocation after exceeding the heap size, and it is not about the GC. So it doesn't really matter which plan it is. But I changed it to "all" just in case any plan triggers GC differently (mainly ConcurrentImmix).

@wks
Copy link
Collaborator Author

wks commented Oct 15, 2025

I removed eager_polling and added migration guide.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me.

There are just some minor points about the documentation -- I think it currently over-specifies the behavior of the new boolean flags. It’s good to be specific and precise, but the docs shouldn’t expose internal implementation details or define behaviors that are not controlled by the flag.

Comment on lines 35 to 49
/// Whether over-committing is allowed at this allocation site.
///
/// **The default is `false`**.
///
/// If `true`, the allocation will still try to acquire pages from page resources even when a GC
/// is triggered by the polling.
///
/// If `false` the allocation will not try to get pages from page resource as long as GC is
/// triggered.
///
/// Note that MMTk lets the GC trigger poll before trying to acquire pages from the page
/// resource. This gives the GC trigger a chance to trigger GC if needed. `allow_overcommit`
/// does not disable polling, but only controls whether to try acquiring pages when GC is
/// triggered.
pub allow_overcommit: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc is way more detailed than what is necessary.

None of these is related with 'over commit':

  • MMTk will acquire pages
  • MMTk uses page resources

'Overcommit' only means one thing: MMTk may go beyond the specified heap size, in order to satisfy this allocation request. Additionally, MMTk still triggers GC when it overcommits memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll simplify the description and make the point that "when over-committing, it may allocate beyond the heap size".

Comment on lines 51 to 67
/// Whether the allocation is at a safepoint.
///
/// **The default is `true`**.
///
/// If `true`, the allocation is allowed to block for GC, and call [`Collection::out_of_memory`]
/// when out of memory. Specifically, it may block for GC if any of the following happens:
///
/// - The GC trigger polled and triggered a GC before the allocation tries to get more pages
/// from the page resource, and the allocation does not allow over-committing.
/// - The allocation tried to get more pages from the page resource, but failed. In this
/// case, it will force a GC.
///
/// If `false`, the allocation will immediately return a null address if the allocation cannot
/// be satisfied without a GC. It will never block for GC, never force a GC, and never call
/// [`Collection::out_of_memory`]. Note that the VM can always force a GC by calling
/// [`crate::MMTK::handle_user_collection_request`] with the argument `force` being `true`.
pub at_safepoint: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. at_safepoint means MMTk may block the thread in this allocation request for GC. These are not related with at_safepoint:

  • MMTk may call out_of_memory. MMTk calls out_of_memory when it runs out of memory, which is unrelated with at_safepoint. Before we have a specific flag like allow_oom_call, there is no definition when MMTk may call out_of_memory -- it is an implementation detail.
  • The reasons why MMTk may block for GC are implementation details.
  • handle_user_collection_request is unrelated. It is a separate API, and is not related with alloc_with_options.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have a method allow_oom_call. It previously returns true only for OnAllocationFail::RequestGC. Maybe we should add another option AllocationOptions::allow_oom_call for that.

handle_user_collection_request is relevant because if alloc_with_option(at_safepoint=false) cannot force a GC, and it can't satisfy the allocation request without a GC, the user would trigger a GC manually instead. Currently, it mimics the behavior of OnAllocationFail::RequestGC and it forces a GC when it is at safepoint, and GC is initialized.

Alternatively, we can make "forcing GC" (i.e. the second poll() invocation with space_full = true) unstoppable, too. That is, as long as we fail to get pages from the page resource, it will force a GC. But it may return null if it is not at a safepoint. This will slightly change the control flow. One concern is that what will happen if GC is not initialized, and we failed to get pages from the PR, and it is not at safepoint? Should it panic immediately, or should it return null? @qinsoon what do you think of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinions on this are:

  1. Binding users do not need to know the 'forced GC'. There is little point mentioning it.
  2. at_safepoint only makes sure the thread will not be blocked in this call for GC. It does not do anything more than that.
  3. Whether we 'force' GC after a failed allocation is an implementation detail.
  4. I personally think we probably still want to 'force' the GC after a failed allocation, but do not block the current thread. In this case, at_safepoint only changes the behavior of blocking, and does not change the behavior of GC triggering.

One concern is that what will happen if GC is not initialized, and we failed to get pages from the PR, and it is not at safepoint? Should it panic immediately, or should it return null?

With at_safepoint=true, the behavior is panic. We could keep that behavior with at_safepoint=false. at_safepoint doesn't need to change that behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the documentation so that at_safepoint no longer guarantees anything other than not blocking for GC.

I also made allow_oom_call another option. But the current behavior of allow_oom_call is quite inconsistent. It only controls Space::handle_obvious_oom_request and LockFreeImmortalSpace::acquire, but not Allocator::alloc_slow_inline or util::memory::handle_oom_error both of which call Collection::out_of_memory. I didn't change the current behavior, and I am leaving it to another pull request.

@wks wks added this pull request to the merge queue Oct 17, 2025
Merged via the queue into mmtk:master with commit 1ffa5b3 Oct 17, 2025
31 of 32 checks passed
@wks wks deleted the feature/overcommit-still-triggers-gc3 branch October 17, 2025 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants